Biostatistics & Epidemiology — Part 2

Validity, Inference & Time‑to‑Event

Austin Meyer, MD, PhD, MS, MPH, MS

2025-10-24

Lecture Overview

What We’ll Cover Today

  • Part 1: Measurement That Matters
    • Differentiate reliability (repeatability) and validity (truth)
    • Common reliability types: test–retest, internal consistency, inter‑rater
  • Part 2: Hypothesis Testing in the Wild
    • What a p‑value means and when to reject H0
    • Power and Type I/II errors (planning studies to avoid mistakes)
  • Part 3: Choosing the Right Statistical Test
    • When to use ANOVA vs chi‑square, with clinical examples
  • Part 4: Clinical Impact & Precision
    • ARR → NNT and interpreting 95% confidence intervals
  • Part 5: Time‑to‑Event Outcomes
    • Kaplan–Meier curves, censoring, and log‑rank interpretation

How We’ll Work Through Questions

  • Present a clinical vignette and answer options
  • 20–30 seconds of think time (pair‑share)
  • Reveal the answer and debrief with 2–3 teaching points
  • Where helpful, we’ll add a simple visual to anchor the concept



Part 1: Measurement That Matters

Reliability & Validity

Question 1: Reliability vs Validity

A physician is meeting with a group of fellows for their weekly research symposium, and the topic of discussion is reliability and validity. They discuss a scenario regarding a hospital that is testing a new depression screening tool. One of the goals is to compare the use of a tool between residents and attending physicians, and evaluate the degree to which the 2 groups independently score patients positive for depression.

Of the following, this scenario refers to the research concept of:

Debrief: Reliability vs. Validity

Key Takeaway: Reliability is about consistency (getting the same result repeatedly), while validity is about truthfulness (measuring the right thing).

Three Types of Reliability

When Would Other Answers Be Correct?

  • A. Content Validity: This assesses whether a test covers all relevant aspects of the concept it claims to measure. It’s about the test’s content, not about agreement between raters.

  • B. Internal Consistency Reliability: This measures how well different items on the same test correlate with each other (e.g., do all questions on a depression screener measure the same underlying construct?). It doesn’t involve different raters.

  • D. Predictive Validity: This assesses how well a test predicts a future outcome (e.g., do high scores on the depression screener predict a future diagnosis of major depressive disorder?).

Question 2: Generalizability (External Validity)

A hospitalist works at a large, urban hospital that serves a diverse pediatric patient population with a high burden of severe asthma and atopic disease. For an upcoming journal club, the hospitalist division will review a recently published randomized controlled trial. The trial concludes that a new medication for asthma exacerbations, “BtrBreth,” improves asthma symptoms and decreases length of hospital stay as compared to standard asthma therapy. The study was a randomized controlled trial conducted at 20 small pediatric hospitals in rural England with a total sample size of 1,500 patients. The researchers included children with intermittent asthma admitted for an asthma exacerbation and excluded any with persistent asthma, allergies, eczema, or other comorbidities.

Of the following, the application of this study to the hospitalist’s patients is MOST limited by a lack of:

Debrief: Generalizability

Key Takeaway: A study’s results are only useful if they can be applied to your specific patient population (external validity).

Internal vs External Validity Matrix

When Would Other Answers Be Correct?

  • A. Causality: The study is a randomized controlled trial (RCT), which is the gold standard for establishing causality. The limitation is not in determining if the drug caused the outcome within the study, but if that causal relationship applies elsewhere.

  • C. Internal Validity: As an RCT, the study likely has high internal validity, meaning its conclusions about the studied population are probably sound. The problem is not the study’s internal rigor but its external applicability.

  • D. Power: With a sample size of 1,500, the study is likely well-powered to detect a difference if one exists within its specific population.

Part 2: Hypothesis Testing in the Wild

Question 3: Interpreting a p‑value

Investigators in a recent study assessed admission rates for acute croup. They hypothesized that there would be a significant difference in admission rates among children who received one treatment with nebulized epinephrine compared with children who received multiple treatments of nebulized epinephrine. They sought to detect a difference of 20% in admission rates with a predefined α of .05 and 80% power to reach statistical significance. Using a large inpatient database, they identified 80,000 children who received one dose of nebulized epinephrine and 8,000 children who received more than one dose of nebulized epinephrine for acute croup. Admission rates were 10% in the former group and 70% in the latter group (P = .01).

Of the following, the statement that BEST reflects the results of this study is:

Debrief: Interpreting a p-value

Key Takeaway: A p-value below alpha allows you to reject the null hypothesis, but it doesn’t describe the effect’s size or clinical importance.

Understanding the p-value

When Would Other Answers Be Correct?

  • A. Causality: This is an observational study. The children receiving multiple doses of epinephrine were likely much sicker to begin with. The treatment didn’t cause the admission; the underlying severity of their illness did. This is a classic example of confounding by indication.

  • C. Odds Ratio: While the odds ratio could be calculated from the data, the p-value itself does not provide this information. A p-value only speaks to the statistical significance of the finding, not the magnitude of the effect.

  • D. Underpowered: The study included 88,000 children and found a highly significant p-value (p=.01), so it was not underpowered.

Question 4: Power & Type II Error

A clinical randomized controlled trial is performed to test the efficacy of a new type of corticosteroid inhaler as compared with “usual care” in preventing exacerbations of asthma in children. Enough patients are enrolled to provide a statistical power of 70%. Asthma severity, as measured by means of a validated score, was reduced by 35% in the intervention group and by 28% in the usual care group (p = .13). Using the common criterion of p < .05 as the cutoff point for significance, the authors conclude that there is no statistical difference between intervention and control groups.

Of the following, the BEST estimate of the probability that the authors have reached their conclusion in error is:

Debrief: Power & Type II Error

Key Takeaway: Power is your study’s ability to find a real effect; low power means a high chance of a false negative (a Type II error).

Understanding Statistical Power

When Would Other Answers Be Correct?

  • A. 5%: This is the value of α (alpha), the probability of a Type I error. A Type I error can only occur if you reject the null hypothesis. Since the authors failed to reject the null, a Type I error is not the concern here.

  • B. 7%: This might seem tempting because it matches the observed difference between groups (35% - 28% = 7 percentage points), but this is simply the magnitude of the effect observed in the study. The probability of Type II error (reaching the wrong conclusion when failing to reject the null) is determined by the study’s power. With 70% power, β = 1 - 0.70 = 30%.

  • C. 13%: This is the p-value (0.13), which tells us the probability of observing these data (or more extreme) if the null hypothesis were true. The p-value is not the probability that the authors’ conclusion is an error. When we fail to reject the null, the relevant error probability is the Type II error rate (β), which equals 30% in this study (calculated as 1 - power).

Question 5: Type I vs Type II Errors

A new drug is being studied for the treatment of eosinophilic esophagitis. It is a formulation of budesonide in individual pouches with a ready-made viscous solution that eliminates the need for the patient or parent to mix budesonide liquid vials before ingestion. The study randomized 50 children between the ages of 4 and 14 years with eosinophilic esophagitis into 2 groups of medical therapies. They compared the results of esophageal biopsies from the diagnostic endoscopy to a subsequent endoscopy after 3 to 5 months on either formulation of budesonide. The study failed to find a significant difference in the number of eosinophils per HPF in those who used the new drug compound compared with those who used the budesonide vials mixed with a sucralose-based artificial sweetener.

Of the following, the MOST accurate statement regarding this study is that:

Debrief: Type I vs. Type II Errors

Key Takeaway: Failing to find a difference in a small study doesn’t mean there isn’t one; you may have simply made a Type II error.

Type I and Type II Error Decision Tree

When Would Other Answers Be Correct?

  • A. Alternative Hypothesis Supported: The study failed to find a significant difference, so the alternative hypothesis (that there is a difference) was not supported.

  • B. Type I Error: A Type I error (false positive) occurs when you reject the null hypothesis. Since the study failed to reject the null, a Type I error is not the risk.

  • D. Null Hypothesis Rejected: The opposite is true. The study failed to reject the null hypothesis, leading to the conclusion of “no significant difference.”

Part 3: Choosing the Right Statistical Test

Question 6: Testing Groups

A pediatric nephrologist is studying factors that affect phosphorus levels in children with end-stage renal disease (ESRD). The nephrologist has collected laboratory data on all children with ESRD over a 5-year period from 6 dialysis centers serving a large, urban, diverse community. Fifty percent of the children were identified as Hispanic, 30% as African-American, 10% as white, and 10% did not have race/ethnicity information available.

Of the following, the BEST statistical approach to determine whether mean serum phosphorus levels differ by race/ethnicity in children with ESRD in this cohort is:

Debrief: ANOVA for Comparing Means

Key Takeaway: Use ANOVA to compare the means of a continuous variable (e.g., phosphorus levels) across three or more groups.

ANOVA: Comparing Means Across Groups

When Would Other Answers Be Correct?

  • B. Chi-square: This test is used for comparing proportions or frequencies of categorical data (e.g., comparing the proportion of patients in each ethnic group who are above a certain phosphorus threshold), not for comparing continuous means.

  • C. Linear Regression: While regression could model phosphorus levels with ethnicity as a predictor, ANOVA is the more direct and standard statistical test for specifically asking whether the means of a continuous variable differ across several categories.

  • D. Paired Sample t-test: This is used to compare the means of two related groups (e.g., measuring phosphorus levels in the same patients before and after an intervention). It is not suitable for comparing four independent groups.

Question 7: Testing Proportions

A pediatric hospitalist is interested in studying asthma readmissions in her hospital. The hospital admits a large number of patients with asthma. She wants to compare the proportion of asthma readmissions in patients discharged with an asthma action plan compared to asthma patients who are discharged without an asthma action plan.

Of the following, the MOST appropriate statistical test for this study is:

Debrief: Chi-Square for Proportions

Key Takeaway: Use a Chi-Square test to compare the proportions of a categorical outcome (e.g., readmitted vs. not) between two or more groups.

Chi‑Square Test: Comparing Proportions

Readmitted Not Readmitted Total
Action Plan 25 175 200
No Action Plan 45 155 200
Total 70 330 400

Chi‑Square Test Calculation:

  • Observed readmission rates: 12.5% (Action Plan) vs 22.5% (No Plan)
  • χ² compares observed counts to expected counts under H0 (no difference)
  • Use Fisher’s exact test when expected cell counts < 5

When Would Other Answers Be Correct?

  • A. Analysis of Variance (ANOVA): This is used for comparing the means of a continuous outcome (like blood pressure) across two or more groups. The outcome here is categorical (readmitted vs. not readmitted).

  • C. McNemar Test: This is used for paired or matched categorical data, such as a “before-and-after” study on the same individuals (e.g., did the proportion of patients with a positive attitude change after receiving an action plan?). The groups here (with vs. without a plan) are independent.

  • D. Paired t-test: This is used for a continuous outcome in paired groups. This study has a categorical outcome and independent groups.

Part 4: Clinical Impact & Precision

Question 8: ARR and NNT

A new medication has been studied for use in patients with juvenile arthritis. In polyarticular rheumatoid factor–positive juvenile idiopathic arthritis, it has been shown to reduce the 2-year risk of joint erosion from 30% to 10%.

Of the following, the NUMBER of patients needed to be treated with this medication to decrease the number of children with joint erosion by 1 is:

Debrief: Calculating NNT

Key Takeaway: NNT translates risk reduction into an intuitive number: how many people you need to treat to prevent one bad outcome.

Visualizing ARR and NNT

Alternative Calculation: ARR = 0.40 − 0.20 = 0.20 → NNT = 1/0.20 = 5

When Would Other Answers Be Correct?

  • A. 3: An NNT of 3 would imply an Absolute Risk Reduction (ARR) of 1/3 or 33.3%. The ARR here is 20%.

  • C. 15: This might result from incorrectly dividing the control risk by 2 (30% ÷ 2 = 15), or from mistakenly calculating the midpoint between the treatment risk and ARR (10% and 20%). However, NNT is always calculated as 1/ARR. With an ARR of 20% (0.20), the correct calculation is 1 ÷ 0.20 = 5, not 15.

  • D. 20: This is the ARR (20%), not the NNT. The NNT is the reciprocal of the ARR (1 / 0.20).

Question 9: Confidence Intervals

A researcher is reviewing a study on adolescents with celiac disease (CD). The goal of the study is to determine whether adherence to a gluten-free diet differs based on the method of diagnosis—endoscopic biopsy or serologic markers alone. The outcome measure is the serum tissue transglutaminase (TTG) level 1 year after diagnosis and initiation of the gluten-free diet. A serum immunoglobulin (Ig) A TTG value of less than 15 was selected as normal.

The results of the study show the mean TTG level 1 year after diagnosis:

Biopsy-Confirmed: TTG 15 U/mL (95% CI: 10-20 U/mL)
Serology-Confirmed: TTG 30 U/mL (95% CI: 18-42 U/mL)

Of the following, the BEST interpretation of these results is that:

Debrief: Understanding Confidence Intervals

Key Takeaway: A 95% CI provides a range for the true population mean, not a range containing 95% of individual patients.

Comparing Mean TTG Levels by Diagnostic Method

Key Point: The CI tells us where the population mean likely falls, not where 95% of individuals fall.

When Would Other Answers Be Correct?

  • A. and D.: This is a common misinterpretation. The 95% CI is a range for the population mean, not a range containing 95% of the individual subjects’ data points. The range of individual values would be much wider.

  • C.: While non-overlapping confidence intervals suggest a statistically significant difference, the CI itself doesn’t give a “95% likelihood” of that difference. It’s a statement about the precision of the mean estimate. The proper way to assess significance is with a formal hypothesis test (like a t-test), which would yield a p-value.

Part 5: Time‑to‑Event Outcomes

Question 10: Survival Analysis

A researcher is designing a study on how long it takes Epstein-Barr virus (EBV) IgG-negative pediatric transplant recipients to convert to EBV IgG-positive status after receiving a kidney from an EBV-positive donor. There will be 2 groups; one receives a 90-day course of antiviral therapy after transplant and another receives a 365-day course of antiviral therapy after transplant.

Of the following, the BEST statistical approach is:

Debrief: Survival Analysis

Key Takeaway: For “time-to-event” outcomes, use survival analysis (like Kaplan-Meier) to properly account for time and for patients who leave the study (censoring).

Kaplan–Meier Survival Curves

When Would Other Answers Be Correct?

  • A. ANOVA / C. t-test: These tests compare means at a single point in time. They cannot handle time-to-event data or account for censoring (when a patient leaves the study before the event occurs). They would ignore crucial information about when the seroconversion happened.

  • D. χ² test: This test could compare the proportion of patients who seroconverted by a certain deadline (e.g., by 1 year), but it loses all the information about the timing of conversions that occurred before that deadline. Survival analysis uses all the time-based data.

Summary

Key Takeaways

  • Reliability ≠ Validity: Reliability is consistency; validity is measuring what you intend to measure.

  • Generalizability matters: High internal validity doesn’t help if results don’t apply to your population.

  • p‑values show data compatibility with H0: Report effect sizes and confidence intervals for clinical meaning.

  • Plan for power: Low power (high β) inflates false negatives (Type II errors).

  • Match the test to the data: ANOVA for comparing >2 means; χ² for comparing proportions.

  • Report ARR/NNT: Absolute risk reduction and number needed to treat convey clinical impact.

  • 95% CI precision: Narrow CIs come from larger samples; CIs tell us about population parameters.

  • Survival analysis for time‑to‑event: Use Kaplan–Meier curves when outcomes unfold over time.

Remember: The goal is not just to know statistics, but to apply them wisely in clinical practice.